In Sports Analytics, there has been evergoing research on whether the market price of a Player is overpriced or not. Various attributes factor this analysis like Age, Position, games/minutes played, player efficiency rating etc. This project aims to explore the dataset of NBA players based on salary and their season statistics for the year 2016-2017. The aim of this project is to generate a regression model that can be used to predict player’s salary which can help in determining the accuracy of player’s salary.
Prediction method based on regression model can be defined as: Regression models predict a value of Y variable from the dataset for any other known value or values X variables from the dataset. Prediction of regression model is based on:
y-intercept: Dependent variable
x-intercept: Independent variable
Prediction requires consideration of good attributes which depends on the type of variables considered for generating a model. This is important as the factors might affect the prediction of salary significantly.
The coefficients in the equation have the relationship between each independent variable and dependent variable. While this is true, also entering values for independent variables into the equation for predicting mean value is achievable for dependent variable.
This is needed for generating unbiased predictions. Getting predictions precise is done when the observed values cluster are close to the predicted values. For solving this issue we can consider hypothesis
NBA dataset is a good example for solving the task on prediction as it provides relevant quantitative variables like Age, PER, FPG, G, TS, AST and also some qualitative variables like Pos, Tm etc.
The dataset will be statistically checked to determine if assumptions are met to determine if Points Scored(PPG) is a factor that affects the prediction significantly i.e study will be needed to test whether the mean (PPG) differs and is a factor that is affecting the salary significantly.
All statistical analyses will be performed using R (software version 4.1.2).
Null Hypothesis (H0):Mean PPG has no impact on salary \[ H_0: \mu_1 = \mu_1 = \mu_3 \] Alternative hypothesis (Ha): Atleast some difference in Mean PPG impact salary \[ H_A: Some~ \mu~ is \ne \]
The approach for designing our model will be based on our dataset and the variables that we choose.
The objective of our study is to build a regression model and evaluate it based on our NBA dataset variables and player’s season_stats dataset.
Designing the method for prediction analysis requires selection of variables based on their correlation strength with the dependent variable.
Following are the packages required with their use:
tidyverse = Allows for data manipulation and works in harmony with other packages as well
plotly = graphical representation in r
rstatix = performing statistical tests
data.table = Dataframe Enhancement Tool
corrplot = Generation of Correlation plots
PerformanceAnalytics = Performance and Prediction Analytics Package
GGally = A ggplot2 extension
install.packages(“rsconnect”)
Plotting First DataSet
Plotting Second Dataset
Added Columns
Obtaining the types of Data available in season_salary dataset.
This dataset contains 573 rows for columns of X(player_id), Player(Name), Team(Tm) and Season Salary(Season17_18).
## 'data.frame': 573 obs. of 4 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Player : chr "Stephen Curry" "LeBron James" "Paul Millsap" "Gordon Hayward" ...
## $ Tm : chr "GSW" "CLE" "DEN" "BOS" ...
## $ season17_18: num 34682550 33285709 31269231 29727900 29512900 ...
Obtaining the types of Data available in player_stats dataset.
This dataset contains 24691 rows for 53 columns with major features being, Position(Pos), Age, Games Played(G), Points Scored(Pts), Minutes Played(MP) etc.
## 'data.frame': 24691 obs. of 53 variables:
## $ X : int 0 1 2 3 4 5 6 7 8 9 ...
## $ Year : int 1950 1950 1950 1950 1950 1950 1950 1950 1950 1950 ...
## $ Player: chr "Curly Armstrong" "Cliff Barker" "Leo Barnhorst" "Ed Bartels" ...
## $ Pos : chr "G-F" "SG" "SF" "F" ...
## $ Age : int 31 29 25 24 24 24 22 23 28 28 ...
## $ Tm : chr "FTW" "INO" "CHS" "TOT" ...
## $ G : int 63 49 67 15 13 2 60 3 65 36 ...
## $ GS : int NA NA NA NA NA NA NA NA NA NA ...
## $ MP : int NA NA NA NA NA NA NA NA NA NA ...
## $ PER : num NA NA NA NA NA NA NA NA NA NA ...
## $ TS. : num 0.368 0.435 0.394 0.312 0.308 0.376 0.422 0.275 0.346 0.362 ...
## $ X3PAr : num NA NA NA NA NA NA NA NA NA NA ...
## $ FTr : num 0.467 0.387 0.259 0.395 0.378 0.75 0.301 0.313 0.395 0.48 ...
## $ ORB. : num NA NA NA NA NA NA NA NA NA NA ...
## $ DRB. : num NA NA NA NA NA NA NA NA NA NA ...
## $ TRB. : num NA NA NA NA NA NA NA NA NA NA ...
## $ AST. : num NA NA NA NA NA NA NA NA NA NA ...
## $ STL. : num NA NA NA NA NA NA NA NA NA NA ...
## $ BLK. : num NA NA NA NA NA NA NA NA NA NA ...
## $ TOV. : num NA NA NA NA NA NA NA NA NA NA ...
## $ USG. : num NA NA NA NA NA NA NA NA NA NA ...
## $ blanl : logi NA NA NA NA NA NA ...
## $ OWS : num -0.1 1.6 0.9 -0.5 -0.5 0 3.6 -0.1 -2.2 -0.7 ...
## $ DWS : num 3.6 0.6 2.8 -0.1 -0.1 0 1.2 0 5 2.2 ...
## $ WS : num 3.5 2.2 3.6 -0.6 -0.6 0 4.8 -0.1 2.8 1.5 ...
## $ WS.48 : num NA NA NA NA NA NA NA NA NA NA ...
## $ blank2: logi NA NA NA NA NA NA ...
## $ OBPM : num NA NA NA NA NA NA NA NA NA NA ...
## $ DBPM : num NA NA NA NA NA NA NA NA NA NA ...
## $ BPM : num NA NA NA NA NA NA NA NA NA NA ...
## $ VORP : num NA NA NA NA NA NA NA NA NA NA ...
## $ FG : int 144 102 174 22 21 1 340 5 226 125 ...
## $ FGA : int 516 274 499 86 82 4 936 16 813 435 ...
## $ FG. : num 0.279 0.372 0.349 0.256 0.256 0.25 0.363 0.313 0.278 0.287 ...
## $ X3P : int NA NA NA NA NA NA NA NA NA NA ...
## $ X3PA : int NA NA NA NA NA NA NA NA NA NA ...
## $ X3P. : num NA NA NA NA NA NA NA NA NA NA ...
## $ X2P : int 144 102 174 22 21 1 340 5 226 125 ...
## $ X2PA : int 516 274 499 86 82 4 936 16 813 435 ...
## $ X2P. : num 0.279 0.372 0.349 0.256 0.256 0.25 0.363 0.313 0.278 0.287 ...
## $ eFG. : num 0.279 0.372 0.349 0.256 0.256 0.25 0.363 0.313 0.278 0.287 ...
## $ FT : int 170 75 90 19 17 2 215 0 209 132 ...
## $ FTA : int 241 106 129 34 31 3 282 5 321 209 ...
## $ FT. : num 0.705 0.708 0.698 0.559 0.548 0.667 0.762 0 0.651 0.632 ...
## $ ORB : int NA NA NA NA NA NA NA NA NA NA ...
## $ DRB : int NA NA NA NA NA NA NA NA NA NA ...
## $ TRB : int NA NA NA NA NA NA NA NA NA NA ...
## $ AST : int 176 109 140 20 20 0 233 2 163 75 ...
## $ STL : int NA NA NA NA NA NA NA NA NA NA ...
## $ BLK : int NA NA NA NA NA NA NA NA NA NA ...
## $ TOV : int NA NA NA NA NA NA NA NA NA NA ...
## $ PF : int 217 99 192 29 27 2 132 6 273 140 ...
## $ PTS : int 458 279 438 63 59 4 895 10 661 382 ...
The filteration technique was used to obtain a dataset for player’s stats based on the year: 2016
There was a need to mutate features like MPG, PPG, APG etc. to perform regression analysis based on per Game.
Filtered data can be viewed here as:
Our merged data betwen filtered_2016 dataset and season_salary dataset can be viewed here.
Following plot is for checking correlation
Following plot is for checking correlation
Table shows the correlation value for the following variables
## filter_sal PPG MPG TOPG RPG PER SPG
## 1.0000000 0.6806714 0.6149170 0.5301139 0.5333931 0.5454683 0.4429916
## APG
## 0.3883572
Correlation strength is: PPG > MPG > TOPG > RPG > PER > SPG > APG
The interesting part of this is that the number of turnovers players make in direct correlation to their salary, and the relationship shows a positive correlation.
The interpretation can be demonstrated in the following way:
The more turnovers they make will result in increased involvement and outcome will be more ball movements in a game.
This way, we can claim that the players who make more turnovers, are directly or at important to their team. This can be expressed as “agressiveness”.
This assumption will be used in place of the duration for which player holds/contains ball in possession in a game.
Following plot shows the Boxplot distribution of Points scored per game. We can see multiple outliers in this graph. This graph display median value as 8. And, the outliers are accumulated near the range 25-30.
When hovered over the highest valued data, we receive the following information for the data point.
##
## Call:
## lm(formula = filter_sal ~ ., data = salary_regression)
##
## Coefficients:
## (Intercept) MPG PPG APG RPG TOPG
## -2335560 -64693 998724 1142968 839990 -3937386
## BPG SPG
## 1865318 836451
This proves the Assumption that the player who makes more number of turnovers will get paid higher or will receive increase in Salary.
Finding Salary for an average player is necessary for determining the per salary increase after taking Trusted and Agressiveness factors into consideration.
An average player receives the following Salary per Season
## [1] 8271766
Here we are generating a Multiple Regression model based on Factors like Trusted and Agressiveness.
We can determine the model_fit by obtaining the coefficients of the features:
##
## Call:
## lm(formula = filter_sal ~ Trusted * Agressiveness, data = salary_regression)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13221467 -3611877 -1639386 3916099 20950162
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3973906 518474 7.665 1.66e-13 ***
## TrustedYes 4609276 1034546 4.455 1.12e-05 ***
## AgressivenessYes 1607136 1595830 1.007 0.3146
## TrustedYes:AgressivenessYes 3542071 1916651 1.848 0.0654 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6579000 on 363 degrees of freedom
## Multiple R-squared: 0.3107, Adjusted R-squared: 0.305
## F-statistic: 54.55 on 3 and 363 DF, p-value: < 2.2e-16
Obtaining Coefficients for the features.
## (Intercept) TrustedYes
## 3973906 4609276
## AgressivenessYes TrustedYes:AgressivenessYes
## 1607136 3542071
Results
There is a direct correlation based on trust and salary increase.
This leads to more play time allowed by coach and increase in probability to score more Points per Game(PPG).
The salary will increase by $311,416.
Let’s try to predict salary for the Most Valuable Player for the year 2016. The Professional Basketballer “Russel Westbrook” played for the team Oklahoma City Thunder and has a sensational season both in Regular and Playoffs Season as well.
The statistics for his performance was referenced by the website Basketball Reference: Russel Westbrook.
Russel Westbrook’s Statistics for the year was:
Points per Game (PPG): 31.6
Minutes played per Game (MPG): 34.6
Turnovers per Game (TOPG): 5.4
Based on the stats for Russel Westbrook, we generate a prediction_fit model and compute the Prediction Salary.
## [1] "PPG:31.6,MPG:34.6,TOPG:5.4 ==> Expected Salary for player: $25,707,363"
According to my prediction model, he will get $25,707,363 next season.
The Official Salary Russel Westbrook received in his next season was: $28,530,608.
This resulted with a 93 percent accuracy for our prediction model.
Our Assessment Study of this Dataset provided many deep insights about the dataset. Based on our study:
[Github.com] (https://github.com/rstudio/flexdashboard)
[RStudio Website] (https://shiny.rstudio.com/gallery/)
[LAB files and Homework files- STAT-6020] (https://clemson.app.box.com/s/ufxtp6yjv6ov8js5g3djmd7zwgxvd6ed)
[Kaggle.com] (https://www.kaggle.com/datasets/koki25ando/salary?select=NBA_season1718_salary.csv)
[Basketball Reference: Russel Westbrook] (https://www.basketball-reference.com/players/w/westbru01/gamelog/2017)
[Medium.com] (https://towardsdatascience.com/predicting-housing-prices-with-r-c9ec0821328d)
[Rpubs.com] (https://rpubs.com/ishantnayer/234221)
I am a Graduate Student at Clemson University perusing his Master’s degree in Computer Science. I have completed Bachelor’s in Computer Engineering from Mumbai, India and have developed deep interest in Data Visualization after my Internship on creating dashboards using Streamlit package in Python. This was one of the key reasons for completing the course on Statistical Computing in R which helped me in learning EDA and Visualization along with basics of Hypothesis testing in R.
In the last years, I have learned working with datasets with quantitative and qualitative variables. However, this project gave me insights on prediction modelling and the capability it demonstrates in making the analysis in bettering perspective over the field of research.
#LOADING PACKAGES
library(data.table) #Enhanced dataframe
library(corrplot) #For plotting correlation plot
library(GGally) #Alternative Package(An Extension) for ggplot2
library(tidyverse) #Package that we can use for data analysis
library(PerformanceAnalytics) # Package for Performance, Prediction Analysis
library(plotly) #For improving Graphics on plots
library(rstatix) #Package for performing statistical tests
# Dataset is read
##First dataset
season_salary <-
read.csv("NBA_season1718_salary.csv")
season_salary
##Second dataset
player_stats <- read.csv("Seasons_Stats.csv")
player_stats
#displaying datatypes of variables
#datatypes for dataset 1
str(season_salary)
#datatypes for dataset 2
str(player_stats)
#Filtering dataset for receving datapoints for the year: 2016.
#Selecting relevant variables like Year, G, PER, FG, PTS etc.
#Mutating columns named MPG, RPG, PPG, APG, TOPG, BPG, SPG for further analysis
filtered_2016 <-
player_stats %>% filter(Year == 2016) %>%
select(Year:G, MP, PER, FG:PTS) %>%
distinct(Player, .keep_all = TRUE) %>%
mutate(MPG = MP/G, PPG = PTS/G, APG = AST/G,
RPG = TRB/G, TOPG = TOV/G, BPG = BLK/G,
SPG = STL/G)
# Performing full join using merge function for Joining filtered_2016 dataset and season_salary dataset based on common variable "Player".
salary_new <- merge(filtered_2016, season_salary, by.x = "Player", by.y = "Player")
# Renaming a column for salary
names(salary_new)[40] <- "filter_sal"
#Deleting unwanted column
salary_new <- salary_new[-39]
#Generating correlation plot for salary with other influential variables
#to check whether these variables have correlation.
corrplot(cor(salary_new %>%
select(filter_sal, MPG:SPG,
Age, PER, contains("%")),
use = "complete.obs"),
method = "number",type = "upper")
## Plotting scatterplot matrix for correlation for features with salary variables.
# Made use of ggpairs function.
cor_salary_new <-
salary_new %>%
select(filter_sal, PPG, MPG, TOPG, RPG, PER, SPG, APG)
ggpairs(cor_salary_new)
#Correlation table
cor(cor_salary_new)[,"filter_sal"]
#Generation of Interactive plot for plotting influence of salary received by the player over points scored per game.
#plotly function is used
#hoverinfo argument provides useful technique in generating a plot with dynamic nature. Extremely useful for plotting real-time data.
names(salary_new)[5] <- "Team"
plot_ly(data = salary_new, x = ~filter_sal, y = ~PPG, color = ~Team,
hoverinfo = "text",
text = ~paste("Player: ", Player,
"<br>Salary: ", format(filter_sal, big.mark = ","),"$",
"<br>PPG: ", round(PPG, digits = 3),
"<br>Team: ", Team)) %>%
layout(
title = "Salary vs Point Per Game",
xaxis = list(title = "Player's Salary in USD"),
yaxis = list(title = "Point per Game for Player")
)
#Plotting scatterplot of regression Model for salary variable on Points Scored per Game to see variance in the graph
salary_new %>%
ggplot(aes(x = filter_sal, y = PPG)) +
geom_point() +
geom_smooth(method = "lm")
#For determining the effect of Trust on Turnovers.
#Obtaining average for MPG feature using mean function
avg.minutes <- mean(salary_regression$MPG)
#Obtaining average for TOPG feature using mean function
avg.turnover <- mean(salary_regression$TOPG)
#Converting categorical variable into factors for Yes and No values
salary_regression$Trusted <- as.factor(ifelse(salary_regression$MPG >= avg.minutes, "Yes", "No"))
#Converting categorical variable into factors for Yes and No values
salary_regression$Agressiveness <- as.factor(ifelse(salary_regression$TOPG >= avg.turnover, "Yes", "No"))
#Printing first 10 rows of the dataset
head(salary_regression)
# Generating regression plot to determine any difference in salary based on Agressiveness
salary_regression %>%
ggplot(aes(x = filter_sal, y = PPG, colour = Agressiveness)) +
geom_point() +
geom_smooth(method="lm")
#Getting average player's salary by finding the mean for all players
mean_salary <- mean(salary_regression$filter_sal)
mean_salary
#Fitting a multiple regression model for salary based on Trusted and Agressiveness
trust_lm <- lm(formula = filter_sal ~ Trusted * Agressiveness, data=salary_regression)
#Plotting summary table for our regression model
summary(trust_lm)
# Coefficients Table
coef(trust_lm)
#Building a prediction function for predicting value based on PPG, MPG, TOPG.
# Printing output value for our function
salary_prediction <- function(m, point, minutes, turn_over){
pre_new <- predict(m, data.frame(PPG = point, MPG = minutes, TOPG = turn_over))
msg <- paste("PPG:", point, ",MPG:", minutes, ",TOPG:", turn_over, " ==> Expected Salary for player: $", format(round(pre_new), big.mark = ","), sep = "")
print(msg)
}
# Fitting values for obtaining output
model <- lm(formula = filter_sal ~ PPG + MPG + TOPG, data = salary_regression)
salary_prediction(model, 31.6, 34.6, 5.4)
Clemson University School of Computing.
Prof. Ellen Breazel, Teacher in STAT-6020 Intro to Statistical Computing - Spring 2022 at Clemson University.